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A statistical approach to the traceroute-Hke 
exploration of networks: theory and simulations 

Luca Dall'Asta, Ignacio Alvarez-Hamelin, Alain Barrat, Alexei Vazquez, Alessandro Vespignani 



Abstract — Mapping the Internet generally consists in 
sampling the network from a limited set of sources by us- 
ing traceroute-Uke probes. This methodology, akin to 
the merging of different spanning trees to a set of destina- 
tions, has been argued to introduce uncontrolled sampling 
biases that might produce statistical properties of the sam- 
pled graph which sharply differ from the original ones [7-9]. 
Here we explore these biases and provide a statistical analy- 
sis of their origin. We derive a mean-field analytical approx- 
imation for the probability of edge and vertex detection that 
exploits the role of the number of sources and targets and 
allows us to relate the global topological properties of the 
underlying network with the statistical accuracy of the sam- 
pled graph. In particular we find that the edge and vertex 
detection probabiUty is depending on the betweenness cen- 
trality of each element. This allows us to show that short- 
est path routed sampUng provides a better characterization 
of underlying graphs with scale-free topology. We comple- 
ment the analytical discussion with a throughout numeri- 
cal investigation of simulated mapping strategies in different 
network models. We show that sampled graphs provide a 
fair quaUtative characterization of the statistical properties 
of the original networks in a fair range of different strate- 
gies and exploration parameters. The numerical study also 
allows the identification of intervals of the exploration pa- 
rameters that optimize the fraction of nodes and edges dis- 
covered in the sampled graph. This finding might hint the 
steps toward more efficient mapping strategies. 

Keywords — Traceroute, Internet exploration. Topology in- 
ference. 



I. Introduction 

A significant research and teclinical cliallenge in the 
study of large information networks is related to the lack 
of highly accurate maps providing information on their ba- 
sic topology. This is mainly due to the dynamical nature 
1 of their structure and to the lack of any centralized con- 
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trol resulting in a self-organized growth and evolution of 
these systems. A prototypical example of this situation is 
faced in the case of the physical Internet. The topology 
of the Internet can be investigated at different granularity 
levels such as the router and Autonomous System (AS) 
level, with the final aim of obtaining an abstract represen- 
tation where the set of routers or ASs and their physical 
connections (peering relations) are the vertices and edges 
of a graph, respectively. In the absence of accurate maps, 
researchers rely on a general strategy that consists in ac- 
quiring local views of the network from several vantage 
points and merging these views in order to get a presum- 
ably accurate global map. Local views are obtained by 
evaluating a certain number of paths to different destina- 
tions by using specific tools such as traceroute or by 
the analysis of BGP tables. At first approximation these 
processes amount to the collection of shortest paths from 
a source node to a set of target nodes, obtaining a partial 
spanning tree of the network. The merging of several of 
these views provides the map of the Internet from which 
the statistical properties of the network are evaluated. 

By using this strategy a number of research groups have 
generated maps of the Internet [1-5], that have been used 
for the statistical characterization of the network proper- 
ties. Defining Q = {V, E) as the sampled graph of the 
Internet with N = \V\ vertices and \E\ edges, it is quite 
intuitive that the Internet is a sparse graph in which the 
number of edges is much lower than in a complete graph; 
i.e. l£;| < N{N - l)/2. Equally important is the fact 
that the average distance, measured as the shortest path, 
between vertices is very small. This is the so called small- 
world property, that is essential for the efficient function- 
ing of the network. IVIost surprising is the evidence of a 
power-law relationship between the frequency of vertices 
and their degree k defined as the number of edges linking 
each vertex to its neighbors. Namely, the probability that 
any vertex in the graph has degree k is well approximated 
by P{k) ~ k-^ with 2 < 7 < 2.5 [6]. Evidence for the 
heavy-tailed behavior of the degree distribution has been 
collected in several other studies at the router and AS level 
[10-14] and have generated a large activity in the field of 
network modeling and characterization [15-19]. 

While traceroute -driven strategies are very flexi- 



ble and can be feasible for extensive use, the obtained 
maps are undoubtedly incomplete. Along with technical 
problems such as the instability of paths between routers 
and interface resolutions [20], typical mapping projects 
are run from relatively small sets of sources whose com- 
bined views are missing a considerable number of edges 
and vertices [14,21]. In particular, the various spanning 
trees are specially missing the lateral connectivity of tar- 
gets and sample more frequently nodes and links which 
are closer to each source, introducing spurious effects that 
might seriously compromise the statistical accuracy of the 
sampled graph. These sampling biases have been explored 
in numerical experiments of synthetic graphs generated by 
different algorithms [7-9]. Very interestingly, it has been 
shown that apparent degree distributions with heavy-tails 
may be observed even from regular topologies such as in 
the classic Erdos-Renyi graph model [7,8]. These studies 
thus point out that the evidence obtained from the anal- 
ysis of the Internet sampled graphs might be insufficient 
to draw conclusions on the topology of the actual Internet 
network. 

In this work we tackle this problem by performing a 
mean-field statistical analysis and extensive numerical ex- 
periments of shortest path routed sampling in different net- 
works models. We find an approximate expression for the 
probability of edges and vertices to be detected that ex- 
ploits the dependence upon the number of sources, targets 
and the topological properties of the networks. This ex- 
pression allows the understanding of the qualitative behav- 
ior of the efficiency of the exploration methods by chang- 
ing the number of probes imposed to the graph. More- 
over, the analytical study provides a general understand- 
ing of which kind of topologies yields the most accurate 
sampling. In particular, we show that the map accuracy 
depends on the underlying network betweenness central- 
ity distribution; the broader the distribution the higher the 
statistical accuracy of the sampled graph. 

We substantiate our analytical finding with a throughout 
analysis of maps obtained varying the number of source- 
target pairs on networks models with different topological 
properties. The results show that single source mapping 
processes face serious limitations in that also the target- 
ing of the whole network results in a very partial discov- 
ery of its connectivity. On the contrary, the use of multiple 
sources promptly leads to a consistent increase in the accu- 
racy of the obtained maps where the statistical degree dis- 
tributions are qualitatively discriminated also at low values 
of target density. A detailed discussion of the behavior of 
the degree distribution and other statistical quantities as 
a function of target and sources is provided for sampled 
graphs with different topologies and compared with the in- 



sight obtained by analytical means. 

We also inspect quantitatively the portion of discovered 
network in different mapping process imposing the same 
density of probes to the network. We find the presence of 
a region of low efficiency (less nodes and edges discov- 
ered) depending on the relative proportion of sources and 
targets. Furthermore, the analysis of the optimal range of 
sources and targets for the estimate of the network aver- 
age degree and clustering indicates a different parameters 
region. This finding calls for a "trade-off" between the ac- 
curacy in the observation of different quantities and hints 
to possible optimization procedures in the traceroute- 
driven mapping of large networks. 

II. Related work 

A certain number of works have been devoted to the 
study of sampled graphs obtained by shortest path prob- 
ing procedures, and to the assessment of their accuracy. 
We present a short survey of the works which are related 
to ours. 

Work by Lakhina et al. [7] has shown that power-law 
like distributions can be obtained for subgraphs of Erdos- 
Renyi random graphs when the subgraph is the result of 
a traceroute exploration with relatively few sources 
and destinations. They discuss the origin of these biases 
and the effect of the distance between source and target in 
the mapping process. 

In a recent work [8], Clauset and Moore have studied 
analytically the single source probing to all possible desti- 
nations of an Erdos-Renyi random graph with average de- 
gree k. In agreement with the numerical study of Lakhina 
et al. [7] they have found that the connectivity distribution 
of the obtained spanning tree displays a power-law behav- 
ior k"^, with an exponential cut-off setting in at a charac- 
teristic degree kc ^ k. 

In Ref. [9], Petermann and De Los Rios have studied 
a traceroute-like procedure on various examples of 
scale-free graphs, showing that, in the case of a single 
source, power-law distributions with underestimated ex- 
ponents are obtained. Analytical estimates of the mea- 
sured exponents as a function of the true ones were also 
derived. Finally, in a recent preprint appeared during the 
completion of our work, Guillaume and Latapy [23] report 
about the shortest-paths explorations of synthetic graphs, 
comparing properties of the resulting sampled graph with 
those of the original network. The exploration is made 
using level plots for the proportion of discovered nodes 
and edges in the graph as a function of the number of 
sources and targets, giving also hints for optimal place- 
ment of sources and targets. All these pieces of work make 
clear the relevance of determining up to which extent the 
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Fig. 1 

Illustration of the traceroute-like procedure. 

Shortest paths between the set of sources and the 

SET OF destination TARGETS ARE DISCOVERED (SHOWN 

IN FULL LINES) WHILE OTHER EDGES ARE NOT FOUND 

(DASHED LINES). NOTE THAT NOT ALL SHORTEST PATHS 

ARE FOUND SINCE THE "UNIQUE SHORTEST PATH" 

PROCEDURE IS USED. 



topological properties observed in sampled graphs are rep- 
resentative of that of the real networks. 

III. Modeling the traceroute discovery of 

UNKNOWN NETWORKS 

In a typical traceroute study, a set of active sources 
deployed in the network run traceroute probes to a set 
of destination nodes. Each probe collects information on 
all the nodes and edges traversed along the path connect- 
ing the source to the destination, allowing the discovery of 
the network [20]. By merging the information collected on 
each path it is then possible to reconstruct a partial map of 
the network (see FiglQi. More in detail, the edges and the 
nodes discovered by each probe will depend on the metric 
A4 used to decide the path between a pair of nodes. While 
in the Internet many factors, including commercial agree- 
ment and administrative routing policies, contribute to de- 
termine the actual path, it is clear that to a first approx- 
imation the route obtained by traceroute-like probes 
is the shortest path between the two nodes. This assump- 
tion, however, is not sufficient for a proper definition of a 
traceroute model in that equivalent shortest paths be- 
tween two nodes may exist. In the presence of a degener- 
acy of shortest paths we must therefore specify the metric 
Ai by providing a resolution algorithm for the selection of 
shortest paths. 

For the sake of simplicity we can define three selection 
mechanisms defining different A^ -paths that may account 
for some of the features encountered in Internet discovery: 



• Unique Shortest Path (USP) probe. In this case the short- 
est path route selected between a node i and the destination 
target T is always the same independently of the source S 
(the path being initially chosen at random among all the 
equivalent ones). 

• Random Shortest Path (RSP) probe. The shortest path 
between any source-destination pair is chosen randomly 
among the set of equivalent shortest paths. This might 
mimic different peering agreements that make independent 
the paths among couples of nodes. 

• All Shortest Paths (ASP) probe. The metiic discovers all 
the equivalent shortest paths between source-destination 
pairs. This might happen in the case of probing repeated 
in time (long time exploration), so that back-up paths and 
equivalent paths are discovered in different runs. 
Actual traceroute probes contain a mixture of the 
three mechanisms defined above. We do not attempt, how- 
ever, to account for all the subtleties that real studies en- 
counters, i.e. IP routing, BGP policies, interface resolu- 
tions and many others. Each traceroute probe pro- 
vides a test of the possible biases and we will see that the 
different metrics have only little influence on the general 
picture emerging from our results. On the other hand, it 
is intuitive to recognize that the USP metric represents the 
worst case scenario since, among the three different meth- 
ods, it yields the minimum number of discoveries. For this 
reason, if not otherwise specified, we will report the USP 
data to illustrate the general features of our synthetic ex- 
ploration. 

More formally, the experimental setup for our simu- 
lated traceroute mapping is the following. Let G = 
{V, E) be a sparse undirected graph with vertices (nodes) 
V = {1, 2 • • • , iV} and edges (links) E = {i,j}. Then 
let us define the sets of vertices S = {ii,i2,- ■ ■ , ins } ^^d 
'^ = {ji ) J2 1 • ■ ■ 1 JNt } specifying the random placement 
of Ns sources and Nt destination targets. For each en- 
semble of source-target pairs il = {5, T}, we compute 
with the metric A4 the path connecting each source-target 
pair. The sampled graph Q = {V* , E*) is defined as the set 
of vertices V* (with N* = \V*\) and edges E* induced by 
considering the union of all the 7V4-paths connecting the 
source-target pairs. The sampled graph is thus analogous 
to the maps obtained from real traceroute sampling of 
the Internet. 

In our study the parameters of interest are the density 
Pt = Nt/N and ps = Ns/N of targets and sources. 
In general, traceroute-driven studies run from a rela- 
tively small number of sources to a much larger set of des- 
tinations. For this reason, in many cases it is appropriate 
to work with the density of targets pT while still consider- 
ing Ns instead of the corresponding density. Indeed, it is 



clear that while 100 targets may represent a fair probing of 
a network composed by 500 nodes, this number would be 
clearly inadequate in a network of 10^ nodes. On the con- 
trary, the density of targets px allows us to compare map- 
ping processes on networks with different sizes by defining 
an intrinsic percentage of targeted vertices. In many cases, 
as we will see in the next sections, an appropriate quantity 
representing the level of sampling of the networks is 



NsNt 

N 



PtNs, 



(1) 



that measures the density of probes imposed to the system. 
In real situations it represents the density oftraceroute 
probes in the network and therefore a measure of the load 
provided to the network by the measuring infrastructure. 

In the following, our aim is to evaluate to which extent 
the statistical properties of the sampled graph Q depend on 
the parameters of our experimental setup and are represen- 
tative of the properties of the underlying graph G. 

IV. Mean-field theory of the discovery bias 

We begin our study by presenting a mean-field statisti- 
cal analysis of the simulated traceroute mapping. Our 
aim is to provide a statistical estimate for the probability 
of edge and node detection as a function of Ns, Nt and 
the topology of the underlying graph. 

For each set Jl = {5, T} we can define the quantities 
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Ns 
s=l 



1 if vertex i is a target; 

otherwise, 

1 if vertex i is a source; 
otherwise. 



(2) 



(3) 



where 6ij is the Kronecker symbol. These quantities tell 
us if any given node i belongs to the set of sources or 
targets, and obey the sum rules J2i J2t=i ^i,jt — ^T and 
Si Z]s=i ^i,is = ^s- Analogously, we define the quantity 
a\ '"^ that assumes the value 1 if the edge (i, j) belongs 
to the 7V4-path between nodes / and m, and otherwise. 
By using the above definitions, the indicator function that 
a given edge (i, j) will be discovered and belongs to the 
sampled graph is given by 



Ns Nt 

i-n|i-E'^M.E'^"^,4'ri- (4) 
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In the case of a given set Q. = {S, T}, the discovery indi- 
cator function is simply iTij = 1 if the edge (i, j) belongs 
to at least one of the Al -paths connecting the source-target 
pairs, and otherwise. While the above exact expression 



does not lead us too far in the understanding of the discov- 
ery probabilities, it is interesting to look at the process on a 
statistical ground by studying the average over all possible 
realizations of the set Q = {S, T}. By definition we have 
that 




Pt and 




PS, 



(5) 



where (• • •) identifies the average over all possible deploy- 
ment of sources and targets Q,. These equalities simply 
state that each node i has, on average, a probability to be 
a source or a target that is proportional to their respec- 
tive densities. In the following, we will make use of an 
uncorrelation assumption that allows an explicit approxi- 
mation for the discovery probability. The assumption con- 
sists in neglecting correlations originated by the position of 
sources and targets on the discovery probability by differ- 
ent paths. While this assumption does not provide an exact 
treatment for the problem it generally conveys a qualitative 
understanding of the statistical properties of the system. In 
this approximation, the average discovery probability of an 
edge is 



Ns Nt 
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where in the last term we take advantage of neglecting cor- 
relations by replacing the average of the product of vari- 
ables with the product of the averages and using Eq. (|5Jl. 
This expression simply states that each possible source- 
tai-get pair weights in the average with the product of the 
probability that the end nodes are a source and a target; 
the discovery probability is thus obtained by considering 
the edge in an average effective medium (mean-field) of 
sources and targets homogeneously distributed in the net- 
work. This approach is indeed akin to mean-field methods 
customarily used in the study of many particle systems 
where each particle is considered in an effective average 
medium defined by the uncorrected averages of quanti- 



ties. The realization average of (a. 
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is very simple 



in the uncorrelated picture, depending only of the kind 
of A^-path. In the case of the ASP probing we have 

/ (I m) \ (I rri) 

\ i] ) ~ i i ' ^^ ^^^^ ^^^^ P^'''^ contributes to the 
discovery of the edge. In the case of the USP and the 
RSP, however, only one path among all the equivalent ones 
is chosen, and in the average we have that each shortest 
path gives a contribution a^ '"^' /a^'-'"^^ to (a^ l^ ), where 



fj('.™) is the number of equivalent shortest path between 
vertices / and m. 

The standard situation we consider is the one in which 

(l,m)\ 
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that inserted in Eq.© yields 



< 1, we have 
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where 6,- 
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In the case of the USP and 



RSP probing, the quantity bij is by definition the edge be- 
tweenness centrality [24, 25], sometimes also refereed to 
as "load" [26] (In the case of ASP probing, it is a closely 
related quantity). Indeed the vertex or edge betweenness is 
defined as the total number of shortest paths among pairs 
of vertices in the network that pass through a vertex or 
an edge, respectively. If there are multiple shortest paths 
between a pair of vertices, the path contributes to the be- 
tweenness with the corresponding relative weight. The be- 
tweenness gives a measure of the amount of all-to-all traf- 
fic that goes through an edge or vertex, if the shortest path 
is used as the metric defining the optimal path between 
pairs of vertices, and it can be considered as a non-local 
measure of the centrality of an edge or vertex in the graph. 
The edge betweenness assumes values between 2 and 
A^(A^ — 1) and the discovery probability of the edge will 
therefore depend strongly on its betweenness. In partic- 
ular, for vertices with minimum betweenness bij = 2 we 
have 

{■Kij) ~ 2pTPs, (9) 

that recovers the probability that the two end vertices of 
the edge are chosen as source and target. This implies that 
if the densities of sources and targets are small but finite 
in the limit of very large A^, all the edges in the underly- 
ing graph have an appreciable probability to be discovered. 
Moreover, for a large majority of edges with high between- 
ness the discovery probability approaches one and we can 
reasonably expect to have a fair sampling of the network. 
In most realistic samplings, however, we face a very dif- 
ferent situation. While it is reasonable to consider pT a 
small but finite value, the number of sources is not exten- 
sive (Ns ~ 0{1)) and their density tends to zero as A^~^. 
In this case it is more convenient to express the edge dis- 
covery probability as 



(10) 



{■JTij) ~ 1 -exp l-eb 



where e = prNs is the density of probes imposed to the 
system and the rescaled betweenness b^j = N^^b-ij is now 
limited in the interval [2A^^^, A^ — 1]. In the limit of large 
networks A" — > oo it is clear that edges with low between- 
ness have (vTjj) ~ 0{N^^), for any finite value of e. This 
readily tells us that in real situations the discovery process 
is generally not complete, a large part of low betweenness 
edges being not discovered, and that the network sampling 
is made progressively more accurate by increasing the den- 
sity of probes e. 

A similar analysis can be performed for the discovery 
indicator function ttj of a vertex i. For each source-target 
set il we have that 
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(l,m) 



where al''""' = 1 if the vertex i belongs to the Al-path 
between nodes I and m, and otherwise. This time it has 
been considered that each vertex is discovered with proba- 
bility one also if it is in the set of sources and targets. The 
second term on the right hand side therefore expresses the 
probability that the vertex i does not belong to the set of 
sources and targets and it is not discovered by any Al-path 
between source-target pairs. By using the same mean-field 
approximation as previously, the average vertex discovery 
probability reads as 

(vTi) ~ 1 - (1 - P5 - pt) n (^ ~ PTPS (c^f '"''' ' 

(12) 
As for the case of the edge discovery probability, the av- 
erage considers all possible source-target pairs weighted 
with probability prPs- Also in this case, each shortest path 



gives a contribution a'^''"'' /a'^'"''^' to (Uj 
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for the USP 



and RSP models, while (a, 



{l,ni) 



erf '"^ for the ASP 



model. If ptPs ^ 1, by using the same approximations 
used to obtain Eq.® we obtain 



(vTj) ~ 1 - (1 - p5 - pt) exp {-pTPsh 



(13) 



where 6, 
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For the USP and RSP we 



have that 5,; is the vertex betweenness centrality that is 
hmited in the interval [0,A^(A^ - 1)] [24-26]. The be- 
tweenness value 6j = holds for the leafs of the graph, 
i.e. vertices with a single edge, for which we recover 
(ttj) ^ PS + pt- Indeed, this kind of vertices are dan- 
gling ends discovered only if they are either a source or 
target themselves. 



As discussed before, the most usual setup corresponds 
to a density ps ~ 0{N^'^) and in the large N limit we can 
conveniently write 



(vTi) ~ 1 - (1 - pt) exp i-eh 



(14) 



where we have neglected terms of order 0{N^^) and the 
rescaled betweenness bi = N^^bi is now defined in the in- 
terval [0, A^ — 1]. This expression points out that the prob- 
ability of vertex discovery is favored by the deployment of 
a finite density of targets that defines its lower bound. 

We can also provide a simple approximation for the ef- 
fective average degree {k*) of the node i discovered by our 
sampling process. Each edge departing from the vertex 
will contribute proportionally to its discovery probability, 
yielding 



(k*) = E 1 



exp 



-ebi 



e^^-- (15) 



The final expression is obtained for edges with ebij <^ 1. 
In this case, the sum over all neighbors of the edge be- 
tweenness is simply related to the vertex betweenness as 
J2j bij = 2(6i + A^ — 1), where the factor 2 considers that 
each vertex path traverses two edges and the term A^ — 1 
accounts for all the edge paths for which the vertex is an 
endpoint. This finally yields 



{k* 



2e + 2ebi. 



(16) 



The present analysis shows that the measured quantities 
and statistical properties of the sampled graph strongly de- 
pend on the parameters of the experimental setup and the 
topology of the underlying graph. The latter dependence 
is exploited by the key role played by edge and vertex be- 
tweenness in the expressions characterizing the graph dis- 
covery. The betweenness is a nonlocal topological quan- 
tity whose properties change considerably depending on 
the kind of graph considered. This allows an intuitive un- 
derstanding of the fact that graphs with diverse topologi- 
cal properties deliver different answer to sampling experi- 
ments. 

V. Numerical exploration of graphs 

In this section we use the analytical results as a guid- 
ance in the discussion of extensive numerical simulations 
of sampling experiments in a wide range of underlying 
graphs derived from different models. 

A. Graph models definition 

In the following, we will analyze sparse undirected 
graphs denoted by G = {V, E) where the topological 



properties of a graph are fully encoded in its adjacency 
matrix aij, whose elements are 1 if the edge (i, j) exists, 
and otherwise. 

In particular we will consider two main classes of 
graphs: i) Homogeneous graphs in which, for large de- 
gree k, the degree distribution P{k) decays exponentially 
or faster; ii) Scale-free graphs for which P{k) has a heavy 
tail decaying as a power-law P{k) ~ k^'^ . Here the ho- 
mogeneity refers to the existence of a meaningful charac- 
teristic average degree that represents the typical value in 
the graph. Indeed, in graphs with poissonian-like degree 
distribution a vast majority of vertices has degree close to 
the average value and deviations from the average are ex- 
ponentially small in number. On the contrary, scale-free 
graphs are very heterogeneous with very large fluctuations 
of the degree, characterized by a variance of the degree 
distribution which diverges with the size of the network. 

Another important characteristic discriminating the 
topology of graphs is the clustering coefficient 



Ci 



ki[ki i) 



j,h 



that measures the local cohesiveness of nodes. It indeed 
gives the fraction of connected neighbors of a given node 
i. The average clustering coefficient C = jjj^i^^i P™" 
vides an indication of the global level of cohesiveness of 
the graph. The number is generally very small in random 
graphs that lack of correlations. In many real graphs the 
clustering coefficient appears to be very high and oppor- 
tune models have been formulated to represent this prop- 
erty. 

In the following we will make use of those models that 
can be considered prototypical examples of the various 
classes. 

A. 1 Erdos-Renyi model 

The classical Erdos-Renyi (ER) model [22] for random 
graphs Gn^p, consists of N nodes, each edge being present 
in E independently with probability p. The expected num- 
ber of edges is therefore \E\ = pN{N — l)/2. In order to 
have sparse graphs one thus needs to have p of order 1/A^, 
since the average degree is p{N — 1). Erdos-Renyi graphs 
are a typical example of homogeneous graph, with degree 
distribution following a Poisson law, and very small clus- 
tering coefficient (of order 1/A^). Since Gn,p can consist 
of more than one connected component, we consider only 
the largest of these components. 

A.2 Watts-Strogatz model 

The construction algorithm proposed by Watts and Stro- 
gatz for small- world networks [27] is the following: the 



graph is initially a one-dimensional lattice of length N, 
with periodic boundary conditions (i.e. a ring), each vertex 
being connected to its 2m nearest neighbors (with m > 1). 
The vertices are then visited one after the other; each link 
connecting a vertex to one of its m nearest neighbors in 
the clockwise sense is left in place with probability 1 — p, 
and with probability p is reconnected to a randomly cho- 
sen other vertex. Long range connections are therefore in- 
troduced. The number of edges is \E\ = Nm, indepen- 
dently of p. The degree distribution has a shape similar to 
the case of Erdos-Renyi graphs, peaked around the aver- 
age value. The clustering coefficient, however, is large if 
p <C 1, making this network a typical example of homo- 
geneous but clustered network. As for the ER case, it is 
possible to obtain graphs consisting of more than one con- 
nected component; in this case we use the largest of these 
components. 

A. 3 The Barabasi- Albert model 

Albert and Barabasi have proposed to combine two in- 
gredients to obtain a heterogeneous scale-free graph [28]: 
i) Growth: Starting from an initial seed of Nq vertices con- 
nected by Unks, a new vertex n is added at each time step. 
This new site is connected to m previously existing ver- 
tices; ii) Preferential attachment rule: a node i is chosen 
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by n according to the probability Pn^i - 
a probability proportional to its degree. 



E. 



i.e. with 



After m differ- 
ent vertices have been chosen to be connected to n, the 
growth process is iterated by introducing a new vertex, i.e. 
going back to step i) until the desired size of the network 
is reached. 

This mechanism yields a connected graph of |y| = N 
nodes with \E\ = vnN edges. Graphs constructed with 
these rules have two important characteristics: the degree 
distribution of the nodes follow a power-law P(fc) ~ k~^ 
with 7 = 3, and the clustering coefficient is small. 

A.4 Clustered and scale-free graph model 

Dorogovtsev, Mendes and Samukhin have introduced in 
Ref . [29] a model of growing network with very large clus- 
tering coefficient C: at each time step, a new node is in- 
troduced and connected to the two extremities of a ran- 
domly chosen edge, thus forming a triangle. A given node 
is thus chosen with a probability proportional to its degree, 
which corresponds to the preferential attachment rule. The 
graphs thus obtained have N nodes, 2N edges, and a large 
clustering coefficient (w 0.74) along with a power-law dis- 
tribution for the degree distribution of the nodes. 

The main properties of the various graphs are summa- 
rized in table U Note that the clustering is indeed large for 




Fig. 2 
Cumulative distribution of the average node 
betweenness 6 in the er and ws graph models 
(k = 20). The inset (in lin-lin scale) shows the 

BEHAVIOR OF THE AVERAGE NODE BETWEENNESS AS A 
function of the DEGREE fc. 



the WS and the DMS models, and that, for the scale-free 
models (BA and DMS), the maximum value of the con- 
nectivity ikmax) is much larger than the average k. 

B. Sampling homogeneous graphs 

Our first set of experiments consider underlying graphs 
with homogeneous connectivity; namely the Erdos-Renyi 
(ER) and the Watts-Strogatz (WS) models. We have used 
networks with N = 10^ nodes, k = 20 unless otherwise 
specified; for the WS model, p = 0.1 has been taken. 
We have averaged each measurement over 10 realizations. 
Both models have a Poissonian degree distribution with 
exponential decaying tails. The distribution is therefore 
peaked around the average degree k that represents the typ- 
ical degree of a node. Since the topological properties gov- 
erning the traceroute exploration properties is the between- 
ness centrality it is worth reviewing its general properties 

TABLE I 

Main characteristics of the graphs used in the 

numerical exploration. 
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0.01 


0.52 


0.006 


0.74 


i^raax 


40 


140 


26 


334 


346 







; o 

ER 


E = 0.1 : 
■ £ = 0.5 
i e=l 







ws 



o e = 0.1 
■ e = 0.5 
A e = l 



A 



■ E = 0.5 

A E=l 






ER ° 



i 08880000O 



WS 



o E = 0.1 
■ E = 0.5 

A E=l 



20 
k 



Fig. 3 
Frequency N^/Nk of detecting a vertex of degree 
k (top) and proportion of discovered edges (fc*) /k 

(bottom) as a function of the degree in the ER AND 

ws GRAPH models. THE EXPLORATION SETUP CONSIDERS 

iV5 = 2 AND INCREASING PROBING LEVEL e OBTAINED BY 

PROGRESSIVELY HIGHER DENSITY OF TARGETS px- THE y 

AXIS IS IN LOG SCALE TO ALLOW A FINER RESOLUTION. 



in the case of the models considered here. In Fig Ewe re- 
port the vertex betweenness distribution for both the ER 
and WS models, confirming their poissonian distribution 
with an exponentially fast decaying tail. The vertex and 
edge betweenness are as well homogeneous quantities in 
these networks and their distiibutions are peaked around 
the average values h and he, respectively. These values 
can be considered as typical values and the betweenness 
distribution is narrowly distributed around these charac- 
teristic values. Moreover, on average, the betweenness is 
related to the degree of the vertices obtaining a h{k) that 
increases with the degree. On the other hand, in homo- 
geneous graphs the range of variation in the degree is ex- 
tremely limited, reverberating in small variations of the be- 
tweenness values. Finally, it must be noted that the degree 
and betweenness distributions do not exhibit a pronounced 
scaling with the size of the network because of the intrinsic 
exponential cut-off. 

Since a large majority of vertices and edges will have a 
betweenness very close to the average value, we can use 
Eq. (fTOl and (fT4l to estimate the order of magnitude of 
probes that allows a fair sampling of the graph. Indeed, 
both (vTij) and (ttj) tend to 1 if e >> max b ,b(, .In 
this limit all edges and vertices will have probability to be 
discovered very close to one. 

At lower value of e, obtained by varying px and A^^, 
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Fig. 4 
Cumulative degree distribution of the sampled ER 

GRAPH with /c = 20 FOR USP PROBES. THE FIGURE ON 

THE LEFT SHOWS SAMPLED DISTRIBUTIONS OBTAINED 

WITH Ns = 2 AND VARYING DENSITY TARGET pr- IN THE 

INSET WE REPORT THE PECULIAR CASE Ns = I THAT 

PROVIDES AN APPARENT POWER-LAW BEHAVIOR WITH 

EXPONENT -1 AT ALL VALUES OF pT- THE INSET IS IN 

LIN-LOG SCALE TO SHOW THE LOGARITHMIC BEHAVIOR OF 

THE CORRESPONDING CUMULATIVE DISTRIBUTION. THE 
RIGHT FIGURE SHOWS SAMPLED DISTRIBUTIONS OBTAINED 

WITH Pt = 0.1 AND VARYING NUMBER OF SOURCES 

A^S.THE solid line is the degree DISTRIBUTION OF THE 

UNDERLYING GRAPH. 



the underlying graph is only partially discovered. We first 
studied the behavior of the fraction N^ /Nj^ of discovered 
vertices of degree k, where Nk is the total number of ver- 
tices of degree k in the underlying graph, and the fraction 
of discovered edges (k*) /k in vertices of degree k. In 
Fig.|3lwe report the behavior of these quantities as a func- 
tion of k for both the ER and WS models. The fraction 
N^/Nk naturally increases by augmenting the density of 
targets and sources, and it is slightly increasing for larger 
degrees. The latter behavior can be easily understood by 
noticing that vertices with larger degree have on average 
a larger betweenness b{k). By using Eq. (fT4l) we have that 
N^/Nk ~ 1 — exp (—eb{k)], obtaining the observed in- 
crease at large k. On the other hand, the range of variation 
of degrees in homogeneous graphs is very narrow and only 
a large level of probing may guarantee very large discovery 
probabilities. Similarly the behavior of the effective dis- 
covered degree can be understood by looking at Eq. ( fT^ 
stating that (k*) /k ~ ek~^{l + b{k)). Indeed the initial 
decrease of {k*) /k is finally compensated by the increase 
ofb{k). 

A very important quantity in the study of the statisti- 
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Fig. 5 

Cumulative degree distribution of the sampled WS 

graph for usp probes. the left figure shows 

sampled distributions obtained with ns ^ 2 and 

varying density target pr- the figure on the right 

shows sampled distributions obtained with pr = o-l 

and varying number of sources ivs.the solid line is 

the degree distribution of the underlying graph. 

The inset shows the logarithmic behavior of the 

cumulative distribution for ns = 1 and pt = 1. 
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Fig. 6 
Cumulative degree distribution of the sampled ER 

with k = 100. Left figure: Ns = 2, and various 

values of pt- The inset corresponds to A^s = 1 as in 

figure|41 Right figure: pt = 0.1, various values of 

Ns- In these cases the distribution shows the 

distinctive presence of plateaus corresponding to 

the peaks induced by the sampling process. 



cal accuracy of the sampled graph is the degree distribu- 
tion. In Fig.|4]we show the cumulative degree distribution 
Pc{k* > k) of the sampled graph defined by the ER model 
for increasing density of targets and sources. Sampled dis- 
tributions are only approximating the genuine distribution, 
however, for Ns > 2 they are far from true heavy-tail 
distributions at any appreciable level of probing. Indeed, 
the distribution runs generally over a small range of de- 
grees, with a cut-off that sets in at the average degree k 
of the underlying graph. In order to stretch the distribu- 
tion range, homogeneous graphs with very large average 
degree k must be considered, however, other distinctive 
spurious effects appear in this case. In particular, since 
the best sampling occurs around the high degree values, 
the distributions develop peaks that show in the cumulative 
distribution as plateaus (see Fig|6ll. The very same behav- 
ior is obtained in the case of the WS model (see Fig. |5ll. 
Finally, in the case of RSP and ASP model, we observe 
that the obtained distributions are closer to the real one 
since they allow a larger number of discoveries. 

Only in the peculiar case of Ns = 1 an apparent scale- 
free behavior with slope — 1 is observed for all target den- 
sities Pt, as analytically shown by Clauset and Moore [8]. 
Also in this case, the distribution cut-off is consistently de- 
termined by the average degree k. It is worth noting that 
the experimental setup with a single source is a limit case 



corresponding to a highly asymmetric probing process; it 
is therefore badly, if at all, captured by our statistical anal- 
ysis which assumes homogeneous deployment. 

The present analysis shows that in order to obtain a sam- 
pled graph with apparent scale-free behavior on a degree 
range varying over n orders of magnitude we would need 
the very peculiar sampling of a homogeneous underlying 
graph with an average degree k ~ 10"; a rather unreal- 
istic situation in the Internet and many other information 
systems where n > 2. 

C. Sampling scale-free graphs 

In this section, we extend the analysis made for homo- 
geneous graphs to the case of highly heterogeneous scale- 
free graphs. We consider the Barabasi- Albert (BA) and the 
Dorogovtsev, Mendes and Samukhin (DMS) graph mod- 
els defined in section IV^ We have used networks of size 
N = 10"^ with A? = 8 for BA and fc = 4 for the DMS, 
and averaged each measurement over 10 realizations. Both 
models have a scale-free distribution P{k) ~ A:"'^ with 
7 ~ 3. Moreover, the DMS model is highly clustered with 
an average C ~ 0.74. The average degree of both models 
is well defined, however, the degree distribution is heavy- 
tailed with fluctuations diverging logarithmically with the 
graph size. This implies that k is not a typical value in the 
network and there is an appreciable probability of finding 
vertices with very high degree. Analogously, the between- 
ness distribution is heavy-tailed, allowing for an apprecia- 
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Fig. 7 

Cumulative distribution of the average node 

betweenness b (top) and its behavior as a function 

of the degree k (bottom) in the ba and dms graph 

MODELS. The plot is in log-log SCALE. 



ble fraction of vertices and edges with very high between- 
ness [30]. In particular it is possible to show (see Fig. 
that in scale-free graphs the site betweenness is related to 
the vertices degree as b{k) ~ A;^, where j3 is an exponent 
depending on the model [30]. Since in heavy-tailed degree 
distributions the allowed degree is varying over several or- 
ders of magnitude, the same will occur for the betweenness 
values, as shown in Fig. In addition, as customary for 
scale-free graphs, the betweenness distribution extends on 
a range of values that increases with the size of the net- 
work: i.e. in principle it does extend up to infinity in an 
infinite network. 

In such a situation, even in the case of small e, ver- 
tices whose betweenness is large enough {h{k)e ^ 1) have 
(vTj) ~ 1. Therefore all vertices with degree k ^ e^i//^ 
will be detected with probability one. This is clearly 
visible in Fig. [8l where the discovery probability N^/N^ 
of vertices with degree k saturates to one for large de- 
gree values. Consistently, the degree value at which the 
curve saturates decreases with increasing e. A similar ef- 
fect is appearing in the measurements concerning (k*) /k. 
After an initial decay (see Fig. |8ll the effective discov- 
ered degree is increasing with the degree of the vertices. 
This qualitative feature is captured by Eq. (fT6l that gives 
(A;*) /k ~ eA;~^(l + h{k)). After an initial decay the term 
k^^b{k) ~ fc^^^ takes over and the effective discovered 
degree approaches the real degree k. 

It is evident from the previous discussions, that in scale- 
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Fig. 8 
Frequency N^/Nk of detecting a vertex of degree 
k (top) and proportion of discovered edges (fc*) /k 

(bottom) as a function of the degree in the BA AND 

DMS GRAPH models. The exploration setup 

CONSIDERS Ns ^2 AND INCREASING PROBING LEVEL e 

obtained BY PROGRESSIVELY HIGHER DENSITY OF 

TARGETS Pt- THE PLOT IS IN LOG-LOG SCALE TO ALLOW A 

FINER RESOLUTION AND ACCOUNT FOR THE WIDE 

VARIATION OF DEGREE IN SCALE-FREE GRAPHS. 



free graphs, vertices with high degree are efficiently sam- 
pled with an effective measured degree that is rather close 
to the real one. This means that the degree distribution 
tail is fairly well sampled while deviations should be ex- 
pected at lower degree values. This is indeed what we ob- 
serve in numerical experiments on BA and DMS graphs 
(see Figs. l9l and flOl. Despite both underlying graphs have 
a small average degree, the observed degree distribution 
spans more than two orders of magnitude. The distribu- 
tion tail is fairly reproduced even at rather small values of 
e. The data shows clearly that the low degree regime is in- 
stead under-sampled providing an apparent change in the 
exponent of the degree distribution. This effect has been 
noticed also by Petermann and De Los Rios in the case of 
single source experiments [9] . 

The present analysis points out that graphs with heavy- 
tailed degree distribution allow a better qualitative repre- 
sentation of their statistical features in sampling exper- 
iments. Indeed, the most important properties of these 
graphs are related to the heavy-tail part of the statisti- 
cal distributions that are indeed well discriminated by the 
traceroute -like exploration. 
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Fig. 9 
Cumulative degree distribution of the sampled BA 

GRAPH FOR USP PROBES. THE TOP FIGURE SHOWS 

SAMPLED DISTRIBUTIONS OBTAINED WITH Ng = 2 AND 

VARYING DENSITY TARGET px. THE FIGURE ON THE 

BOTTOM SHOWS SAMPLED DISTRIBUTIONS OBTAINED WITH 

Pt = 0.1 AND VARYING NUMBER OF SOURCES Ns. THE 

SOLID LINE IS THE DEGREE DISTRIBUTION OF THE 

UNDERLYING GRAPH. 
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Fig. 10 
Cumulative degree distribution of the sampled 

DMS GRAPH FOR USP probes. THE TOP FIGURE SHOWS 

SAMPLED DISTRIBUTIONS OBTAINED WITH Ns ^ 2 AND 

VARYING DENSITY TARGET pr. THE FIGURE ON THE 

BOTTOM SHOWS SAMPLED DISTRIBUTIONS OBTAINED WITH 

PT = 0.1 AND VARYING NUMBER OF SOURCES Ns. THE 

SOLID LINE IS THE DEGREE DISTRIBUTION OF THE 

UNDERLYING GRAPH. 



VI. Optimization of mapping strategies 

In the previous sections we have shown that it is pos- 
sible to have a general qualitative understanding of the 
efficiency of network exploration and the induced biases 
on the Statistical properties. The quantitative analysis of 
the sampling strategies, however, is a much harder task 
that calls for a detailed study of the discovered propor- 
tion of the underlying graph and the precise deployment 
of sources and targets. In this perspective, very important 
quantities are the fraction N* /N and E* / E of vertices 
and edges discovered in the sampled graph, respectively. 
Unfortunately, the mean-field approximation breaks down 
when we aim at a quantitative representation of the results. 
The neglected correlations are in fact very important for 
the precise estimate of the various quantities of interest. 
For this reason we performed an extensive set of numeri- 
cal explorations aimed at a fine determination of the level 
of sampling achieved for different experimental setups. 

In Fig. [^we report the proportion of discovered edges 
in the numerical exploration of the graph models defined 
previously for increasing level of probing e. The level 
of probing is increased either by raising the number of 
sources at fixed target density or by raising the target den- 



sity at fixed number of sources. As expected, both strate- 
gies are progressively more efficient with increasing levels 
of probing. In scale-free graphs, it is also possible to see 
that when the number of sources is Ns ~ 0{1) the in- 
crease of the number of targets achieves better sampling 
than increasing the deployed sources. On the other hand, 
it is easy to perceive that the shortest path route mapping is 
a symmetric process if we exchange sources with targets. 
This is confirmed by numerical experiments in which we 
use a very large number of sources and a number of targets 
prp r^ 0{1/N), where the trends are opposite: the increase 
of the number of sources achieves better sampling than in- 
creasing the deployed targets. 

This finding hints toward a behavior that is determined 
by the number of sources and targets, Ns and Nt- Any 
quantity is thus a function of Ns and Nt, or equivalently 
of Ns and pT- This point is clearly illustrated in Fig. ^J 
and[nl where we report the behavior of E* /E and N*/N 
at fixed e and varying Ns and pT- The curves exhibit 
a non-trivial behavior and since we will work at fixed 
e = ptNs, any measured quantity can then be written 
as f{pT,^/pT) = geipr)- Very interestingly, the curves 
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Fig. 11 
Behavior of the fraction of discovered edges in 

explorations with increasing e. for each 

underlying graph studied we report two curves 

corresponding to larger e achieved by increasing 

THE TARGET DENSITY pr AT iVg = 2 OR THE NUMBER OF 
SOURCES Ns AT px = 0.05. 




Fig. 12 
Behavior as a function of pr of the fraction of 

DISCOVERED EDGES IN EXPLORATIONS WITH FIXED e (HERE 

e = 2). Since e = ptNs, the increase of pt 

corresponds to a lowering of the number of 

sources Ns- 



show a structure allowing for local minima and maxima in 
the discovered portion of the underlying graph. 

This feature can be explained by a simple symmetry ar- 
gument. The model for traceroute is symmetric by 
the exchange of sources and targets, which are the end- 
points of shortest paths: an exploration with {Nt, Ns) = 
(A^i, iV2) is equivalent to one with [Nt, Ns) = {N2, Ni). 
In other words, at fixed e = N1N2/N, a density of targets 
Pt = Ni/N is equivalent to a density p^ = N2/N. Since 
-^2 = e/pT we obtain that at constant e, experiments with 
Pt and p'rp = e/{N pt) are equivalent obtaining by sym- 
metry that any measured quantity obeys the equality 
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This equation implies a symmetry point signaling the pres- 
ence of a maximum or a minimum at pT = e/{N pt)- 
We therefore expect the occurrence of a symmetry in the 
graphs of Fig l^ at pT — ^Je/N. Indeed, the symmetry 
point is clearly visible and in quantitative good agreement 
with the previous estimate in the case of scale-free graphs. 
On the contrary, homogeneous underlying topology have 
a smooth behavior that makes difficult the clear identifica- 
tion of the symmetry point. It must be also noticed that 
USP probes create a certain level of correlations in the ex- 
ploration that tends to hide the complete symmetry of the 
curves. 

The previous results imply that at fixed levels of probing 
e different proportions of sources and targets may achieve 



different levels of sampling. This hints to the search for op- 
timal strategies in the relative deployment of sources and 
targets. The picture, however, is more complicated if we 
look at other quantities in the sampled graph. In FigIT4l 
we show the behavior at fixed e of the average degree k 
measured in sampled graphs normalized with the actual 
average degree k of the underlying graph as a function of 
Pt. The plot shows also in this case a symmetric structure 
with a maximum at the symmetry point. By comparing 
FigE] with FigEl we notice that the symmetry point is 
of a different nature for different quantities. Hence, where 
we have a minimum in the fraction of discovered edges, we 
have the best estimate of the average degree. This implies 
that at the symmetry point the exploration discovers less 
edges than in other setups, however, achieving a more ef- 
ficient sampling of the effective degree for the discovered 
vertices. A similar problem is obtained by studying the 
behavior of the ratio C* /C between the clustering coeffi- 
cient of the sampled and the underlying graphs. We stud- 
ied this quantity for the WS and DMS models that have 
a high clustering level. Also in these cases, as shown in 
FigEl the best level of sampling is achieved at particu- 
lar values of e and Ns that are conflicting with the best 
sampling of other quantities. 

The evidence purported in this section hints to a possible 
optimization of the sampling strategy. The optimal solu- 
tion, however, appears as a trade-off strategy between the 
different level of efficiency achieved in competing ranges 
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Fig. 13 

Behavior as a function of pt of the fraction of 

discovered nodes in explorations with fixed £ 
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Fig. 15 

Behavior as a function of pt of the fraction of the 

normalized average clustering coefficient c*/c 

for a fixed probing level e (here 6 = 2). 




Fig. 14 

Behavior as a function of pt of the fraction of the 

normalized average degree k* /k for a fixed 

probing level £ (here e = 2). 



of the experimental setup. In this respect, a detailed and 
quantitative investigation of the various quantities of in- 
terest in different experimental setups is needed in order 
to pinpoint the most efficient deployment of source-target 
pairs depending on the underlying graph topology. 

VII. Conclusions and outlook 

The rationalization of the exploration biases at the sta- 
tistical level provides a general interpretative framework 
for the results obtained from the numerical experiments on 



graph models. The sampled graph clearly distinguishes the 
two situations defined by homogeneous and heavy-tailed 
topologies, respectively. This is due to the exploration pro- 
cess that statistically focuses on high betweenness nodes, 
thus providing a very accurate sampling of the distribu- 
tion tail. In graphs with heavy-tails, such as scale-free net- 
works, the main topological features are therefore easily 
discriminated since the relevant statistical information is 
encapsulated in the degree distribution tail which is fairly 
well captured. Quite surprisingly, the sampling of homo- 
geneous graphs appears more cumbersome than those of 
heavy-tailed graphs. Dramatic effects such as the exis- 
tence of apparent power-laws, however, are found only in 
very peculiar cases. In general, exploration strategies pro- 
vide sampled distributions with enough signatures to dis- 
tinguish at the statistical level between graphs with differ- 
ent topologies. 

This evidence might be relevant in the discussion of real 
data from Internet mapping projects. Indeed, data avail- 
able so far indicate the presence of heavy-tailed degree 
distribution both at the router and AS level. In the light 
of the present discussion, it is very unlikely that this fea- 
ture is just an artifact of the mapping strategies. The upper 
degree cut-off at the router and AS level runs up to 10^ 
and 10^, respectively. A homogeneous graph should have 
an average degree comparable to the measured cut-off and 
this is hardly conceivable in a realistic perspective (for in- 
stance, it would require that nine routers over ten would 
have more than 100 links to other routers). In addition, 
the major part of mapping projects are multi-source, a fea- 
ture that we have shown to readily wash out the presence 
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of spurious power-law behavior. On the contrary, power- 
law tails are easily sampled with particular accuracy for 
the large degree part, generally at all probing levels. This 
makes very plausible, and a natural consequence, that the 
heavy-tail behavior observed in real mapping experiments 
is a genuine feature of the Internet. 

On the other hand, it is important to stress that while at 
the qualitative level the sampled graphs allow a good dis- 
crimination of the statistical properties, at the quantitative 
level they might exhibit considerable deviations from the 
true values such as average degree, distribution exponent 
and clustering properties. In this respect, it is of major 
importance to define strategies that optimize the estimate 
of the various parameters and quantities of the underlying 
graph. In this paper we have shown that the proportion 
of sources and targets may impact on the accuracy of the 
measurements even if the number of total probes imposed 
to the system is the same. For instance, the deployment of 
a highly distributed infrastructure of sources probing a lim- 
ited number of targets may result as efficient as a few very 
powerful sources probing a large fraction of the address- 
able space. The optimization of large network sampling 
is therefore an open problem that calls for further work 
aimed at a more quantitative assessment of the mapping 
strategies both on the analytic and numerical side. 
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